Concentric layout, a new scientific data layout for matrix data-set in Hadoop file system

نویسندگان

  • Jun Wang
  • Lu Cheng
  • Lizhe Wang
چکیده

Due to the explosive growth in the size of scientific data sets, data-intensive computing and analyzing are an emerging trend in computational science. In these applications, data pre-processing is widely adopted because it can optimize the data layout or format beforehand to facilitate the future data access. On the other hand, current research shows an increasing popularity of MapReduce framework for large scale data processing. However, the data access patterns which are generally applied to scientific data set are not supported by current MapReduce framework directly. This gap motivates us to provide support for these scientific data access patterns in MapReduce framework. In our work, we study the data access patterns in matrix files and propose a new concentric data layout solution to facilitate matrix data access and analysis in MapReduce framework. Concentric data layout is a data layout which maintains the dimensional property in chunk level. Contrary to the continuous data layout adopted in current Hadoop framework, concentric data layout stores the data from the same sub-matrix into one chunk. This layout can guarantee that the average performance of data access is optimal regardless of the various access patterns. The concentric data layout requires re-organizing the data before it is being analyzed or processed. Our experiments are launched on a real-world Halo-finding application; the results indicate that the concentric data layout improves the overall performance by up to 38%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data - intensive file systems for Internet services : A rose by any other

Data-intensive distributed file systems are emerging as a key component of large scale Internet services and cloud computing platforms. They are designed from the ground up and are tuned for specific application workloads. Leading examples, such as the Google File System, Hadoop distributed file system (HDFS) and Amazon S3, are defining this new purpose-built paradigm. It is tempting to classif...

متن کامل

Data - intensive file systems for Internet services : A rose

Data-intensive distributed file systems are emerging as a key component of large scale Internet services and cloud computing platforms. They are designed from the ground up and are tuned for specific application workloads. Leading examples, such as the Google File System, Hadoop distributed file system (HDFS) and Amazon S3, are defining this new purpose-built paradigm. It is tempting to classif...

متن کامل

Data-intensive File Systems for Internet Services: A Rose by Any Other Name... (CMU-PDL-08-114)

Data-intensive distributed file systems are emerging as a key component of large scale Internet services and cloud computing platforms. They are designed from the ground up and are tuned for specific application workloads. Leading examples, such as the Google File System, Hadoop distributed file system (HDFS) and Amazon S3, are defining this new purpose-built paradigm. It is tempting to classif...

متن کامل

JackRabbit: Improved agility in elastic distributed storage

Distributed storage can and should be elastic, just like other aspects of cloud computing. When storage is provided via single-purpose storage devices or servers, elasticity is useful for reducing energy usage. For storage provided via multi-purpose servers, however, such elasticity is needed to provide the cloud infrastructure with the freedom to use those servers for other purposes, which may...

متن کامل

Adaptive Dynamic Data Placement Algorithm for Hadoop in Heterogeneous Environments

Hadoop MapReduce framework is an important distributed processing model for large-scale data intensive applications. The current Hadoop and the existing Hadoop distributed file system’s rack-aware data placement strategy in MapReduce in the homogeneous Hadoop cluster assume that each node in a cluster has the same computing capacity and a same workload is assigned to each node. Default Hadoop d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IJPEDS

دوره 28  شماره 

صفحات  -

تاریخ انتشار 2013